Semi-Automated Resolution of Inconsistency for a Harmonized Multiword Expression and Dependency Parse Annotation

نویسندگان

  • King Chan
  • Julian Brooke
  • Timothy Baldwin
چکیده

This paper presents a methodology for identifying and resolving various kinds of inconsistency in the context of merging dependency and multiword expression (MWE) annotations, to generate a dependency treebank with comprehensive MWE annotations. Candidates for correction are identified using a variety of heuristics, including an entirely novel one which identifies violations of MWE constituency in the dependency tree, and resolved by arbitration with minimal human intervention. Using this technique, we identified and corrected several hundred errors across both parse and MWE annotations, representing changes to a significant percentage (well over 10%) of the MWE instances in the joint corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

HamleDT: To Parse or Not to Parse?

We propose HamleDT – HArmonized Multi-LanguagE Dependency Treebank. HamleDT is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. While the license terms prevent us from directly redistributing the corpora, most of them are easily acquirable for research purposes. What we provide inste...

متن کامل

Annotation of Multiword Expressions in the Prague Dependency Treebank

We describe annotation of multiword expressions in the Prague Dependency Treebank, using several automatic pre-annotation steps. We use subtrees of the tectogrammatical tree structures of the Prague dependency treebank to store representations of the multiword expressions in the dictionary and pre-annotate following occurrences automatically. We also show a way to measure reliability of this ty...

متن کامل

Identification of Flexible Multiword Expressions with the Help of Dependency Structure Annotation

This paper presents our ongoing work on compilation of English multi-word expression (MWE) lexicon and corpus annotation. We are especially interested in collecting flexible MWEs, in which some other constituents can intervene the expression such as “a number of” vs “a large number of” where a modifier of “number” can be placed in the expression while inheriting the original meaning. We first c...

متن کامل

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

Use of Coreference in Automatic Searching for Multiword Discourse Markers in the Prague Dependency Treebank

The paper introduces a possibility of new research offered by a multi-dimensional annotation of the Prague Dependency Treebank. It focuses on exploitation of the annotation of coreference for the annotation of discourse relations expressed by multiword expressions. It tries to find which aspect interlinks these linguistic areas and how we can use this interplay in automatic searching for Czech ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017